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In respect of the accuracy, one of the well-known techniques for human 
detection is the histogram-oriented gradients (HOG) method. Unfortunately, 
the HOG feature calculation is highly complex and computationally 
intensive. Thus, in this research, we aim to achieve a resource-efficient and 
low-power HOG hardware architecture while maintaining its high frame-rate 
performance for real-time processing. A hardware architecture for human 
detection in 2D images using simplified HOG algorithm was introduced in 
this paper. To increase the frame-rate, we simplify the HOG computation 
while maintaining the detection quality. In the hardware architecture, we 
design a cell-based processing method instead of a window-based method. 
Moreover, 64 parallel and pipeline architectures were used to increase the 
processing speed. Our pipeline architecture can significantly reduce memory 
bandwidth and avoid any external memory utilization. an altera field 
programmable gate arrays (FPGA) E2-115 was employed to evaluate the 
design. The evaluation results show that our design achieves performance up 
to 86.51 frame rate per second (Fps) with a relatively low operating 
frequency (27 MHz). It consumes 48,360 logic elements (LEs) and 4,363 
registers. The performance test results reveal that the proposed solution 
exhibits a trade-off between Fps, clock frequency, the use of registers, and 
Fps-to-clock ratio. 
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1. INTRODUCTION 


Image processing has been widely utilized to detect humans [1], [2]. However, detecting humans in 
an image is challenging due to their various and wide range of appearance variables [3]. To recognize the 
existence of humans accurately, extensive computations are required. The system should be highly accurate 
while still able to maintain low-energy and low resource consumption on real-time processing. In respect of 
the accuracy, one of the well-known techniques for human detection is the histogram-oriented gradients 
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(HOG) method; this method is initially proposed by [3] in 2005. In the publication as mentioned earlier, the 
HOG technique was proven to outperform other existing human detection techniques significantly. Since 
then, many researchers have modified the HOG technique to increase the detector performance. Moreover, 
since HOG has a promising accuracy, the idea to combine the method along with other classification 
algorithms to distinguish humans under challenging conditions such as illumination, rotation, and even 
deformation is promising [4]. Besides human detection application, HOG has been widely applied in various 
cases [5]-[7], such as pedestrian detection, classical dance classification [8], vehicle detection [9], traffic sign 
detection, crowd density estimation, general object detection, object tracking, feature matching, feature 
descriptors [10], anomaly detection, digit recognition, and so on. 

HOG based human detection has been explored in many aspects. For example, a cascade-of-rejector 
approach, which is usually utilized for face recognition, was combined with HOG features by [11] to get 
better accuracy. Jia and Zhang [12] combined the HOG method with Viola’s face detection framework to 
perform real-time human detection processing. Zhang et al. [13] reported a computational cost reduction 
using a multi-resolution framework. Wang et al. [14] combined the HOG method with local binary pattern 
(LBP) as the feature sets in the human detection system. Schwartz et al. [15] utilized a partial least square 
analysis to provide a richer descriptor. It was similar to edge-based features that utilizes additional color and 
texture information. There was also research that attempted to combine HOG with human’s body ratio 
estimation technique to distinguish human from nonhuman category [16]. However, most of these works (on 
the HOG topics) explored and experimented with improving the accuracy performance using combinations of 
HOG features and other potential techniques. These accuracy improvements tend to have higher 
computational costs and complexity than the original HOG algorithm as the result of the technique 
combinations. Reducing the computation resources is necessary since hardware implementation of HOG is 
very possible [17]. 

Hardware implementation offers better speed performance and power efficiency to keep up with 
real-time processing requirements [7]. Hence it is expected to provide better performance than the software 
implementation. Many works, as in [18]—[24] utilized a field programmable gate array (FPGA) to implement 
HOG in hardware as it is able to accommodate parallel architectures and suitable for real-time image 
processing [18], [25]. Moreover, it can maintain the design configurable and shorten the design time-to- 
market [26]. Unfortunately, the HOG feature calculation is complicated [4]. Although hardware 
implementation offers a high-speed computation, it could lavishly consume resource and power if not 
appropriately designed. On the other hand, resource-efficient and low-power systems are currently in high 
demand. Trends of electronics and applications are going toward green technology, in which case, resource, 
and energy consumption are important aspects of being considered (i.e., as low as possible). For this reason, 
in this paper, we designed a simplified HOG algorithm, digital hardware architecture and its FPGA 
implementation. The proposed design is dedicated to low-power and resource-efficient characteristics. 

Section 1 of this paper explains the research background and a glance on the HOG technique. In 
section 2, the simplified algorithm is presented along with the equations that have been remodeled to avoid large 
division operations. Later, we describe our hardware architecture that realizes the simplified algorithm presented 
in section 3; it is then followed by the FPGA implementation and its functionality and verification results. 
Performance evaluation is also presented in section 3, and it is enriched with benchmark comparisons with other 
techniques. Finally, we draw a concluding remark to highlight the research and its significant contribution. 


2. MODIFIED HOG ALGORITM 

To reduce the computation complexity, we simplify the computation of HOG-based human 
detection algorithm to make it suitable for hardware implementations. Despite the computational reductions, 
detection quality can still be maintained. The original HOG algorithm has high computational complexity 
due to its division operations and intensive looping operations in its window-based processing. Thus, it is 
more suitable to be implemented using the software due to its complex processes. It has to be simplified and 
modified to suppress its cost and power consumption to be ideal for application-specific integrated circuit 
(ASIC) implementation, which is commonly referred to as its redundancy and concurrency may be exploited 
for parallel and pipeline processing (as addressed by [27]). The designed hardware architecture specifications 
are presented in Table 1. 

Based on the basic idea of HOG algorithm, the input image is divided into cells (C), blocks (B), and 
windows (W) [3], as illustrated in Figure 1. The normalized gradients for these properties, are eventually 
collected over a Window-based detection for person or non-person classification. The cells consist of 8x8 
pixels. Therefore, there are 80X60 cells within a frame. We index the cell in raster scan from | to 4,800 (C; 
to C4800). Every 2X2 cells are grouped into a block. There are 50% overlapping Cell data between each Block 
and its neighbor blocks, both in the horizontal and vertical direction. Therefore, we will have 79X59 Blocks 
within a frame, indexed as | to 4,661 (B; to B4o6;). The window consists of 8x16 cells or equivalent to 
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64x 128 pixels. These numbers of cell size, block size, window size and block overlap give the least miss rate 
compared to other sizes and overlaps [3]. The window will be moved by one cell column or one cell row 
after each evaluation. The pixels in the window will be classified by the support vector machine (SVM) for 
human detection. There are in total 3,285 windows to be analyzed in a 640X480 pixels image. To be fit for 
hardware implementation, we modify the HOG algorithm by proposing cell-based processing, cell 
derivatives with neighboring edge anti-aliasing, magnitude calculation using linear approach, fixed-weighted 
binning, block normalization using newton-raphson algorithm, block-wise SVM classification and fixed- 
point representation methods. The detail of each technique will be described in the following section. 


Table 1. System specification 


Parameter Value 
Image size 640x480 pixels 
Window (W) size 64x 128 pixels 
Block (B) sizes and Cell (C) size 2x2 cells and 8X8 pixels 
Bin size 9 (0°-180°) 
HOG feature size 3,780 
#Window and #Block row and #Block column 3,285 and 15 and 7 
Gradient binning Linear approach to avoid Euclidean distance (L2-norm) 
Histogram normalization Manhattan distance (L/-norm) to avoid Euclidean distance (L2-norm) 
Block overlap 50% 
System clock 27 MHz 

16 px 8 px 8 px 
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Figure 1. Illustration of HOG image structure, (discalimer: the figure is not drawn to real scale) 


2.1. Cell-based processing 

Instead of using Window-based, we used Cell-based processing for computing the derivative value 
in the x-direction (dx) and in y-direction (dy). Figure 2(a) and Figure 2(b) illustrate the how the window- 
based and cell-based processing in raster scan is executed, respectively. Using cell-based processing, we can 
extremely reduce derivative computation redundancy by skipping overlapped cell data computations in 
window-based processing. 

As shown in Figure 2(a), cell-based processing eliminates large overlapped cell area, which results 
in low computational complexity as well as low memory bandwidth requirements [28]. However, the 
derivative value (dx and dy) results are still identical to the original Window-based HOG algorithm. This 
method can be applied because we can reuse the computed cell derivative (dx and dy) data for different 
Windows instead of recalculating the derivative of all the cells inside a window each time a new window is 
evaluated. This proposed method is different from [22], where the computation of the overlapped data is done 
using complex pipeline stage. In this method, we store the calculated cell data in temporary random-access 
memory (RAM). Each time the system analyzes a new window, it will fetch the respective cell calculation 
results from the RAM, hence avoiding unnecessary recalculation. 
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Figure 2. Comparison between two approaches on HOG: (a) Window-based raster scanning and 
(b) Cell-based raster scanning 


2.2. Cell derivates with edge neighboring anti-aliasing 

In the HOG algorithm, the derivative values (dx and dy) are computed for every pixel using 
convolution kernel as (1). Since we utilize cell-based calculation, there will be many edges within a window 
[29]. The edges may produce invalid dx and dy values of pixels located in corner areas as the pixels only 
possess one adjacent pixel instead of two. In order to combat this problem, we assign the dx and dy values of 
pixels in the edge areas to similar values to its neighbor pixels, as shown in Figure 3. We apply this method 
to pixels located in both horizontal and vertical edges. As illustrated in Figure 3, grey-colored squares 
represent pixels with distinct derivatives. meanwhile, blue- and yellow-colored squares represent pixels with 
identical derivatives as the result of duplicating the derivative values of adjacent pixels. 


—1 
h, =[-1 0 nhy=[0| (1) 
1 


identical dx identical dy 


cell (8x8 pixels) cell (8x8 pixels) 


Figure 3. The dx and dy values of cell (8x8 pixels) in the edges are similar to the derivative values of their 
neighbor pixels (color version of this images can be distinguished in the online article version) 


2.3. Magnitude calculation using linear approach 

The original HOG method used euclidean distance (L2-norm) of each derivative value (dx and dy) to 
get the magnitude of each pixel. However, the equation of L2-norm consists of a square-root calculation [30], 
[31], which is very complex to be implemented in the hardware [32]. Therefore, this complicated 
computation needs to be avoided by other approaches for estimation purposes. In this work, we use a linear 
method (2) to calculate the magnitude, instead of L2-norm. Figure 4 reveals the comparison between L2- 
norm (X: 30, Y: 40) and linear methods (X: 30, Y: 42.43). 


O4dx — ifdx > dy 
M(x,Y) =} ax ; (2) 
ru dy ifdx <dy 
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Figure 4. The comparison of magnitude results between L2-norm (red-colored track) and the linear 
(blue-colored track). Color version of this images can be distinguished in the online article version 


As formulated in (2), the magnitude is denoted as M (x, y). Our linear method can significantly 
reduce computational complexity as it merely uses addition and division-by-constant operations. The division 
by three can be implemented by simple shifting and adding function as described in (3). 


<= (a >2)+(a >4)+(a > 6) (3) 


Based on Figure 4, it can be seen that this simplification is able to deliver a satisfactory estimation 
of the actual L2-norm. Three different approximations —for calculating magnitude, angle and distance— will 
be used to avoid square root and divisions in the processing of the HOG. In this work, we do not examine the 
effects of such simplifications as it only to show the proposed linear method compared to L2-norm on a 
graph, as shown in Figure 4. The overall accuracy for the proposed system will be introduced in the future 
work as we will provide the error rate for a specific set of benchmarks. 


2.4. Fixed weighted binning 

For Histogram function, pixel angle can be calculated using complex arctangent and division 
operations. However, computing pixel angle with arctangent function and division will result in a very 
complex computation and of course, it is not suitable for hardware implementation. Furthermore, since pixel 
angles are computed for all pixels, there will be a lot of data to be analyzed. This computation demands very 
large computation cycles and latency. To cope with the requirements, we use a simplified method by setting a 
fixed region of bin for every 10° (Ze., tangent 10°, 30°, 50°, 70°, 90°, 110°, 130°, 150°, and 170°). Suppose 
tangent 10° = 0.1763269807, then it will similarly equal to 23 + 2° + 2°, which is 0.171875. Thus, we will 
have 9 bins with its approximated tangent values, as shown in Table 2. 


Table 2. Tangent value approximations 


Tangent Approximated value 
tan (10°) 234+25+42° 
tan (30°) 214+2442° 
tan (...) re 
tan (170°) 222222 


The pixel angle of pixel A (x, y), is computed from derivative dx and dy values using the following 
pseudocode: 

Algorithm 1 Pixel angle pseudocode 

i =1; 


0; = 10; 
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9i+1) = 20; 

while (d,. tan (0;) < dy <d,. tan (O(i41)) { 
AY) = i417 

exit();} 

else { 

6; = O41)3 

941) = O41) + 10; } 


The tangent multiplication can be approximated with bit shifting and addition operations in the 
hardware implementation. By considering computed pixel value A (x, y), we can calculate the Histogram 
using the rules as described in Table 3. If the pixel angle A (x, y) lies on the bin center, then the magnitude 
value M (x, y) value is entirely stored into the respective bin. For example, if A (4,4) = 20° and M (4,4) = 1.5, 
then bin#] = 1.5. On the other hands, if the pixel angle A (x, y) lies on the bin boundary, then the magnitude 
value M (x, y) is split equally into both neighboring bins. For example, if A (5,5) = 30° and M (5,5) = 4, then 
bin#l = 2 and bin#2 = 2. This scheme has also been used by [33]-[35]. 


Table 3. Bin value rules 


Angle Bin center? Target Bin Weight 
0° No #1 and #9 M (x, y)/2 
10° Yes #1 M (x, y) 


20° No #1 and #2 M (x, y)/2 


2.5. Block normalization using newton-raphson method 

There are several normalization methods that can be employed to normalize the Histogram, such as 
L2-norm and Manhattan distance (L/-norm) [36]-[38]. In this case, L/-norm is more suitable for hardware 
implementation as it does not use square root operations unlike L2-norm, even though further simplification 
approaches are still required. Vector normalization is obtained using (4), where L/-sum is Manhattan distance 
summer. 


v 


Vnorm = |v] Ll-sum (4) 


Since Vnorm = vxd, the distance d is stated as (5), 


v 


(5) 


> |v| Li-sum 


To calculate d, newton-raphson approximation is used as in (8). It is derived from (6) and (7) that are formula 
for xo and x; on a newton-raphson digital blocks. 


Xy = (3 Kn) — (2 X sum) (6) 
X, =X [(2 «K 2n) — (sum x Xo) ] (7) 
[representation] d «12=x, > (4n—12) (8) 


Where n is defined as n = MSB (sum). For instance, if sum = 13, then n = bit 4. The result will be delivered in 
decimal fraction numbers. 


2.5.1. Blockwise SVM classification 

The idea of this method is to multiply the SVM coefficient blockwise, instead of per-window. 
However, it is important to note that block#1 corresponds only to window#1, but block#2 corresponds to both 
window#l and window#2. Thus, block#2 will be used for SVM classifications of block#I and block#2. This 
also applies to other blocks that correspond to multiple windows. Section 3 (results and discussion) will 
further explain the hardware design, which takes advantage of pipelined architecture to handle these 
complicated calculations. The SVM coefficients are trained with the simplified algorithm. We used the 
libsvm library to train our SVM with massachusetts institute of technology (MIT) pedestrian dataset. Then, 
we examined the linear SVM and retrained the false positives. The number of images used for classifier 
training is amount of 924 and 13,680 for positive and negative images trained, respectively. 
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2.5.2. Fixed-point representation 

Fixed-point is used to represent the fractional data. The data-width of all the modules is depicted in 
Table 4; it contains input pixel, derivatives, magnitude, and so on. The bit-width is determined by searching 
for the shortest bit-width in each module that will not cause any bit-overflow or interfere with the calculation 
results. 


Table 4. Functional module bit-width optimization 


Module Sign Bit-Width Data Type 
Input pixel Unsigned 8 Integer 
Derivates Signed 9 Integer 
Magnitude Unsigned 9 Integer 
Histogram bin Unsigned 15 Integer 
Normalized Unsigned 12 Fraction, « 12 
SVM coefficient Signed 14 Fraction, « 12 
Window score Signed 32 Integer 
Detection Unsigned 1 Integer 


3. RESULTS AND DISCUSSION 
3.1. Hardware architecture implementation 

Our system block diagram is shown in Figure 5. The input of the system is 640480 images. The 
output is a grayscale image on an external display. The output image will be marked in parts of the image 
that are believed to be human figures. The system is comprised of a control unit, derivative, gradient binning 
using linear approach, cell grouping, Histogram normalization using L/-norm, and sliding window and SVM 
classification modules. 


LCD 


Control Unit VGA Controller Display module 
Altera DE2-115 
Read address 


Write’read address 
Write address RAM to store 


RAM to store cells : detection result 
(4M9K) Writeread address (1 M9K block} 


zz Truefabe 


Cell 
Derivatives | % °¥ | Gradient Binning Grouping 
Module (Linear Approach) | gata 


Histogram 
Normalization 
(L trovae) 


Figure 5. Block diagram of the proposed system with 29 M9K frame buffer RAM 


In order to increase the processing speed and enable the system to work in a real-time, we applied 
pipeline architecture to our system, as reported in Table 5. The M9K is memory block of altera FPGA DE2-115. 
The pipeline architecture also enables us to reduce the memory bandwidth as it does not require all cell and 
block values to be stored, but only some blocks that correspond to the windows being processed at that time. 
Consequently, we are able to use embedded RAM (internal RAM) as the processing storage instead of external 
RAM, which translates to a significant reduction of pins utilization and power consumption. Table 6 reveals the 
benefits introduced by a pipeline mode in the proposed hardware, which obtained from syntesis process. The 
pipeline architecture enables us to reduce the clock cycle latency dramatically. The overall process without 
pipelining is 13,112; it is obtained from the summation of three digital blocks (ie., cell grouping, block 
normalization, and SVM classification proceses). However, as a note, pipeline does not allow any latency 
improvements for the SVM classification, the block normalization and the cell grouping modules. 

Control unit module is used to generate read and write addresses for embedded RAM used in each 
module. This module plays an important role in our pipeline architecture since the RAM access scheduling 
should be accurate at all times. Embedded RAMs used and its memory size is Table 6. 

The derivative module calculates the cell-based derivative (dx and dy) of the image; this block 
contains pixel derivatives as shown in Figure 6 and anti-aliasing filter as shown in Figure 3. The anti-aliasing 
filter applied to the pixels in the image edges is to overcome the zero-padding convolution problem. We 
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designed highly parallel architecture, as shown in Figure 7. Our architecture can simultaneously calculate the 
derivatives and gradient bins for 64 pixels by calculating them simultaneously. This will significantly reduce 
clock latency. 


Table 5. Reduced cycle count by pipeline architecture Table 6. Embedded RAMs 
Blocks Without With pipelining Embedded RAM Size Data-Width 
pipelining Frame buffer (input (29 M9K block) 8 Bit 
Cell grouping 4,800 4,800 image) 
Histogram normalization 4,719 4,719 Cell grouping (4 M9K block) 8 Bit 
SVM Classification 3,593 3,593 Sliding window (108 M9K block) 8 Bit 
Overall process 13,112 4,888 Store detection result (1 M9K block) 1 Bit 
pixels 
A 
pixels 
dx 5 


Figure 6. Pixel derivates 
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Figure 7. Cell-based parallel architecture 


Gradient binning module consists of a rotator, magnitude calculator and binning unit as Figure 8. 
The rotator unit is comprised of four digital blocks (i.e., inverse, bit extender, comparator and multiplexer 
units), as shown in Figure 9. The inverse unit is used to negate a number based on two’s (2’s) complement 
notation. Bit extender is used to represent a number with larger bits without altering its value. The dx and dy 
representation uses 12 bits instead of 9 bits to avoid overflow as the magnitude calculator and binning unit 
involve shifting and adding the values of dx and dy. Comparators and multiplexers are used to determine the 
quadrant of dx and dy. The magnitude calculator conducts approximation in (3) using four digital blocks (i.e., 
multiplexer, right shifter, comparator, and adder), as shown in Figure 10. Finally, all the magnitudes are 
grouped and summed to the respective bins based on the rule specified in (5). The output of this module is a 
cell histogram data consists of 9 bins x 15 bits that are concatenated into a single line as Figure 11. 
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Figure 8. Gradient binning diagram 
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Figure 9. Rotator architecture 
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Figure 10. Magnitude calculator architecture 


This module is used to group cells into a block and performed block normalization processing. To 
group it with the raster scan sequence, we use line delay, as shown in Figure 12. It is implemented using 
RAM to delay 80 cells of data, which is the row size of the input image, as shown in Figure 1. Each Cell 
consists of 9 bins x 15 bits of data. Using this configuration, we can group four Cells of data that comprise a 
single Block. For example, when cell #82 data is fed to the module, block#1 which consists of cell #1, #2, 
#81, and #82, will be constructed. The block data will then be normalized using combinational circuits. 

To minimize memory resource usage, we decided to only use line delay RAM with the size of 80 
Cells to store all 4800 Cells that will be generated to blocks. The first 80 Cells will be stored initially in the 
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RAM. However, when cell#&/ is fed into the module, the module reads cell#] from RAM and overwrite 
cell#51 to cell#I in the RAM. Block#/ will be entirely constructed when cell#82 is dispensed to the module. 
There are 82 clock latencies to start the block processing: 1 clock cycle for the register at the line delay input, 
1 clock cycle for the register at the output, and 80 clock cycles to fill the line delay RAM initially. After 
providing the module with the first 82 Cells data, it will consume 3 clock cycles to generate a block. The 
complete block diagram is depicted in Figure 13. 


magnitude 


to bin 
summer 


Figure 11. Histogram binning architecture 


C#H82 C#81 C#H2 C#HI 
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Figure 12. Cell grouping architecture 


As stated before, our architecture avoids the usage of any division operation since it is too complex for 
hardware implementation. Each block should be normalized using L/-norm to simplify the calculation. The 
module contains L/-sum to find the Manhattan distance of 9 bin vectors X 4 Cells (36 bin vectors in total). 
Newton-Raphson algorithm is then used to approximate d as in (5). This module uses two multipliers, two 
subtractors, and bit shifters. Finally, the computed vectors are multiplied by d, and then concatenated into a 
single line data as in Figure 14. This module consumes | clock cycle. The results are then stored in the RAM. 

The sliding window works in pace with block-wise SVM classification. We designed a highly 
paralleled and pipelined architecture to be able to calculate 7 windows simultaneously. The SVM 
classification is done column-wise, because several columns are used to calculate more than one windows. 
For example, column #1, which consists of block #/, #51, #16], and #1121, is used to calculate window #1. 
But column #2, which consists of block #2, #82, #162, and #1122, is used to calculate both window #/ and 
#2, and so on. After 7 columns of the respective window consisting of 105 blocks has been calculated, the 
score will be subtracted by an SVM bias. The comparator will then decide whether there is any person or not 
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using the value of the sign bit. The system will continue to analyze a new window after a window has been 
calculated until the whole image has been inspected. The hardware architecture is Figure 15. 
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Figure 13. Architecture of Cell grouping hardware design 


36 connections 


to 
35 sliding window 
top left and SVM unit 


top right 42 


L1 Newton 


bottom left = Summer Raphson 
bottom right — 


block number 


Figure 14. Histogram normalization architecture 
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Figure 15. Window sliding and SVM classification 
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The final module of this design is the display module. In relation with the video graphics array 
(VGA) controller, this module works by generating read address from frame buffer RAM (input image) and 
detection RAM (HOG results). As a note, there are no special requirements for the display module because 
its function is only to drawing a detection box over the object to be detected. It uses 640 X480 pixels with 25 
MHz VGA clock. Both RAMs are fed by the same clock to read the values. By counting the pixels with front 
and back porch, the VGA counts to 800X600. Therefore, it needs 800x600 per 25,000,000 second to 
complete one frame, which is around 52 Fps. In summary, the design must operate at minimum 52 Fps in 
order to fit the VGA configuration. The display module also generates markings on windows that are 
considered to contain human figures. 


3.1.1. Performance implementation 

To evaluate our architecture in terms of effectiveness parameter, we implemented our design in 
FPGA. We used altera DE2-115 board (Cyclone [V EP4CE115 FPGA chip). The board is connected to an 
external VGA display to show the resulting image. We tested several 640X480 pixels color images to verify 
the system functionality. For static detection, we chose the images with relatively small-sized pedestrian 
images with 128x64 pixels instead of full image 640x480 pixels. The image in Figure 16(a) contains the 
best pedestrian detection. The pedestrians are quite similar to our positive training dataset. In Figure 16(b), 
the image has various lighting conditions. However, since HOG uses the gradient feature, our detector can 
still detect the pedestrians and does not interpret the shadows as humans. On the other hand, in Figure 16(c), 
the HOG detector may not be able to detect various poses reliably as their gradients will vary. To increase the 
detector performance, images containing various poses should be used as our training dataset. 


(a) (b) (c) 


Figure 16. Detection results on pedestrians: (a) under uniform lighting; (b) under various lighting condition 
and (c) with various poses 


Figure 17(a) shows the FPGA displaying the detection result to a monitor. The marks drawn on 
humans indicate successful detections. Our architecture is coded in Verilog hardware description language 
(HDL). We use a top-down approach to design the system architecture and a bottom-up approach to code the 
hardware modules. Each sub-module is designed and tested before being integrated. Figure 17(b) shows the 
flow summary of the design analysis and synthesis result. The design consumes 48,360 logic elements (LEs), 
4,363 registers, and 84 of 9-bit embedded multipliers. It merely consumes 0.141 Mbits of memory. Our 
architecture requires 4,888 clock cycles to complete one frame detection of image. It also needs 640 480 = 
307,200 clock cycles to receive the image data and store them into the frame buffer. Therefore, the overall 
system needs 312,088 clock cycles (cacluated from 307,200 + 4,888) to process one frame of image data. It is 
important to note that the frame buffer embedded RAM is a huge speed bottleneck as it is only able to deliver 
one pixel every clock cycle. 

The TimeQuest Timing Analyzer shows that the maximum operating frequency allowed for the 
design is 28.62 MHz. By setting the system clock frequency to 27 MHz, we obtained 86.51 Fps (obtained 
from 27,000,000/3 12,088). Since the VGA has 52 Fps refresh rate, our design will be suitable to be used with 
VGA due to the frame output will always be available every time the VGA refreshes. However, the Fps may 
be improved significantly by reducing the latency of the frame buffer because most of memory resources are 
consumed by the input buffer. The actual processing unit (without frame buffer) is able to deliver 5,523.732 
Fps (obtained from 27,000,000/4,888). This indicates that the frame buffer latency has severely 
overshadowed the actual capability of the processing unit. It may possible to reduce resources in particular 
modules and keep the algorithm consistency at the smaller throughput. 
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Detection 
markings 


(a) 
Flow Status Successful - Fri Jan 27 03:59:29 2017 
Quartus Prime Version 16.1.0 Build 196 10/24/2016 SJ Lite Edition 
Revision Name onesize 
Top-level Entity Name onewindow 
Family Cyclone IVE 
Device EP4CE115F29C7 
Timing Models Final 
Total pins 9/529(2%) 
Total virtual pins 0 
Total memory bits 141,872 / 3,981,312 (4%) 
Total PLLs 0/4(0%) 
(b) 


Figure 17. Performance implementation of the proposed system: (a) experimental setup using Altera DE2- 
115 FPGA and an external monitor and (b) screenshot of synthesis summaries 


As shown in Figure 15, it is regarded as a first-in-first-out (FIFO) function, a static random-access 
memory (SRAM) fo can be used to replace it. Therefore, gate count and power can be saved. The FIFO 
always toggles to consume power, but SRAM only activated one cell to access data. Moreover, ping-pong 
mode can enable a SRAM to perform “read” and “write” concurrently. The detection success of the proposed 
algorithm will be compared with the software implementation of the original HOG algorithm in the near 
future work. This becomes an open challenge that should be exploited. 


3.1.2. Performance comparison 

Table 7 and Table 8 evaluate the performance result with other competitors. The strongest points 
compare to the others are frame per second (Fps) and Fps-to-clock ratio. All the competitors exploit standard 
FPGA boards. Compared to the earlier works, our implementation performs significantly well in terms of the 
ratio of delivered Fps and operating clock frequency. This is made possible with our pipeline architectures 
and custom-designed hardware modules, instead of using general-purpose processors. Processors may be 
smaller in size, but dedicated modules are more efficient compared to processors. Moreover, all cells and 
blocks will not be kept in the RAM concurrently. Unused data will be overwritten to minimize memory 
usage. Additionally, the low operating clock frequency translates to less power consumption. 

Working at a low frequency surely allows power consumption to be reduced. For instance, in 
Table 7 [23] has a much higher image resolution (19201080), higher operational clock (270 MHz), and a 
lower frame rate (64 Fps). Instead, the proposed work has a lower image resolution (640x 480), ten times 
lower operational clock (27 MHz), and a high frame rate (86.51 Fps). This becomes the architecture design 
trade-off. In [39] has the highest Fps (162 MHz) but the operating clock is the highest (150 MHz) resulting in 
lower Fps-to-clock ratio. Instead, the proposed work the highest Fps-to-clock ratio (3.2041), lower frame 
rate, lower operating clock (27 MHz), more efficient in memory usage (141,872 bits), and the lowest 
registers (4,363). With the same image resolution usage (640X480), in [40] has the lowest embedded 
multipliers (40 DSP block) as well as the operating frequency (25 MHz), which is not to close with our 
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proposed architecture. But our architecture has a higher Fps and Fps-to-clock ratio, also resource-efficient. In 


summary, it is safe to say that our architecture is considered the best trade off comprared to previous works, 
in terms of operating clock frequency (MHz), Fps (Hz), Fps-to-clock ratio, and the use of Registers. 


Table 7. Performances comparison against other works used Xilinx Virtex FPGA 


Parameters [20] [21] [23] [41] This work 
FPGA board Virtex-5 Virtex-5 Virtex-5 Virtex-6 Cyclone IV 
LUTs** 28,495 17,383 5,188 (7) 113,359 (J) 48,360 
Registers 5980 2,181 (ft) 5,176 75,071({) 4,363 
Embedded multipliers / DSP block*** 2(f) N/A* 49 72 84 (1) 
Memory usage (bits) 2,196,000 1,327,000 1,188,000 4,284,000 ({) 141,872 (1) 
Operating clock frequency (MHz) 167 44 270 (L) 25 (ft) 27 
Frame per second (Hz) 38 (J) 62 64 60 86.51 (ft) 
Fps-to-clock ratio 0,2275 (1) 1.4091 0.237 2,4 3.2041 (f) 
Image resolution 320x240 (1) 1920x1080 (ft) 640x480 640x480 


* N/A: not available. 

(ft): The highest value in the comparison table among the equivalent parameters compared 

(\): The lowest value in the comparison table among the equivalent parameters compared 

** LUTs in FPGA logic building blocks may not serve as an accurate parameter for design size comparison. LUTs in Cyclone devices 
have 4 inputs, while Xilinx Virtex-5 has 6 inputs. Moreover, logic building blocks differ among devices, as in Altera’s logic element 
(LE) and Xilinx’s logic cell (LC), each with their respective hardware design. These factors may cause different synthesis results in 
Altera and Xilinx FPGAs. 

*** Embedded multiplier refers to 9X9 multipliers in altera cyclone III and IV. DSP block refers to DSP48E slice in Xilinx Virtex-5, 
which is equipped with a 25x18 multipliers. 


Table 8. Performances comparison against other works used ALTERA Cyclone FPGA 


Parameters B9] [42] [43] [40] [44] [45] This work 
FPGA board Cyclone IV Cyclone IV CycloneIV Cyclone Cyclone II Cyclone V Cyclone IV 
LUTs** 16,060 83,497 (|) 34,403 17,419 14,895 11,156 (7) 48,360 
Registers 7,220 17,383 (1) 23,247 11,306 9800 13,191 4,363 (fT) 
Embedded 
multipliers / DSP 69 90 (1) 68 N/A* 40 (f) N/A* 84 
block*** 
Memory usage (bits) 334,000 2,800,000 (1) 348,000 1,046,647 280,000 2,137 (f) 141,872 
Cpertine to 150 () 50 40 10 25(t) "6 27 


frequency (MHz) 
Frame per second 


(Hz) 162 (7) 129 72 20 48 8 (I) 86.51 
Fps-to-clock ratio 1,08 2,58 1.8 0.2857 1.92 0.1053 (1) 3.2041 (7) 
Image resolution 800x600 1280x1024 (f) 800x600 640x480 (1) 


(ft): The highest value in the comparison table among the equivalent parameters compared 
(\): The lowest value in the comparison table among the equivalent parameters compared 


4. CONCLUSION AND FUTURE WORKS 

This paper presents a hardware architecture design to implement a simplified HOG algorithm. We 
have designed a cell-based raster scanning computation instead of window-based to reduce computation 
redundancy. The magnitude calculation using a linear approach provides us with a reasonable approximation 
of magnitude without using exponentiation and square root operations; due to L2-norm approach is too 
difficult to be implemented in hardware implementation. By using fixed-weighted binning for histogram 
classification, we can avoid using arctangent and division operations. Furthermore, by using the newton- 
raphson algorithm, we can execute block normalization without using any division operations. Finally, the 
overall parallel and pipeline architecture gives accurate detection with less memory usage and maximum Fps- 
to-clock frequency ratio. This work used MIT pedestrian dataset for training. The primary feature of this 
work is to simplify HOG for an efficient hardware implementation. This simplification certainly makes some 
degradation on the performance of original HOG. It is important to examine in detail the performance of 
original HOG comprated to this work (simplified HOG) further. We will also address several interesting 
issues, e.g., the impact of computational reduction in term of detection accuracy, measure the throughput 
achieved of the GPU implementations, and a more objective figure-of-metric (FOM). This will be considered 
to prove that the proposed hardware architecture is more efficient than the other competitors. Later, the effect 
of accuracy improvement to computational costs and system complexity will be evaluated further. In the 
recent years, various other challenging datasets have been introduced by many researchers. Therefore, we 
will use various dataset provided globally and dataset produced my ourselves to train and evaluate our 
proposed system comprehensively. 
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